Skip to content

[Refactor] Merge run_api.py into run.py and overhaul reuse/checkpoint mechanism#1499

Open
mzr1996 wants to merge 2 commits intoopen-compass:mainfrom
mzr1996:update-run
Open

[Refactor] Merge run_api.py into run.py and overhaul reuse/checkpoint mechanism#1499
mzr1996 wants to merge 2 commits intoopen-compass:mainfrom
mzr1996:update-run

Conversation

@mzr1996
Copy link
Copy Markdown
Collaborator

@mzr1996 mzr1996 commented Mar 27, 2026

  • Unify run_api.py into run.py via --api-mode flag, removing the separate entry point
  • Redesign reuse system: structured run status (status.json), format-aware prediction file reuse, and three-level --reuse-aux control (all/infer/none) replacing the old boolean flag
  • Standardize checkpoint keys to string type across inference/inference_mt/inference_video for consistent pickle serialization
  • Replace --ignore with --keep-failed (old flag kept as deprecated alias), and Re-running failed samples becomes the default behavior now.
  • Add --base-url support to local mode for direct API model inference without config.py
  • Add judge model configuration args (--judge-base-url, --judge-key, --judge-api-nproc, etc.)
  • Improve executor shutdown with proper SIGTERM/SIGKILL worker termination
  • Use relative symlinks for output files to support directory relocation
  • Clean up hardcoded local model paths in config.py

mzr1996 added 2 commits March 27, 2026 12:01
… mechanism

- Unify run_api.py into run.py via `--api-mode` flag, removing the separate entry point
- Redesign reuse system: structured run status (status.json), format-aware prediction file reuse,
  and three-level `--reuse-aux` control (all/infer/none) replacing the old boolean flag
- Standardize checkpoint keys to string type across inference/inference_mt/inference_video
  for consistent pickle serialization
- Replace `--ignore` with `--keep-failed` (old flag kept as deprecated alias)
- Add `--base-url` support to local mode for direct API model inference without config.py
- Add judge model configuration args (--judge-base-url, --judge-key, --judge-api-nproc, etc.)
- Improve executor shutdown with proper SIGTERM/SIGKILL worker termination
- Use relative symlinks for output files to support directory relocation
- Clean up hardcoded local model paths in config.py
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR consolidates the separate API runner into run.py (--api-mode) and refactors the reuse/checkpoint workflow to support structured run metadata (status.json), format-aware prediction reuse, and more granular auxiliary-file reuse controls.

Changes:

  • Merge run_api.py functionality into run.py behind --api-mode, adding unified judge/inference CLI options and improved executor shutdown/symlink behavior.
  • Redesign reuse logic with status.json, run discovery, prediction completeness checks, and format-converting prediction reuse.
  • Standardize checkpoint/result handling around string keys and add “retry failed by default” controls (--keep-failed, deprecated --ignore).

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
vlmeval/smp/file.py Adds run-id helpers, status.json read/write, run discovery, prediction reuse/conversion, and completeness checks.
vlmeval/inference.py Switches to _checkpoint.pkl and string-key checkpoints; adds retry-failed behavior and cleanup.
vlmeval/inference_mt.py Same checkpoint/retry refactor for multi-turn inference; adds cleanup and PREV handling.
vlmeval/inference_video.py Same checkpoint/retry refactor for video inference; uses standardized pred path and cleanup.
vlmeval/inference_api.py Adds retry_failed, improves executor shutdown/worker termination, and creates relocation-safe relative symlinks.
run.py Introduces --api-mode, new reuse/aux semantics, status.json writing, base-url model support, and judge configuration args.
run_api.py Removed; functionality migrated into run.py --api-mode.
vlmeval/__init__.py Filters a specific pkg_resources deprecation warning.
vlmeval/config.py Removes a hardcoded local path for Intern-S1-mini in favor of a repo-style model path.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +548 to +562
judge_kwargs = get_judge_kwargs(dataset_name, dataset.TYPE, args)
judge_signature = judge_kwargs.get('model', '')

if RANK == 0:
reuse_ctx = prepare_reuse_files(
pred_root_meta=pred_root_meta,
eval_id=eval_id,
model_name=model_name,
dataset=dataset,
result_file=result_file,
reuse=args.reuse,
reuse_aux=args.reuse_aux,
retry_failed=not args.keep_failed,
judge_signature=judge_signature if args.mode != 'infer' else None,
world_size=WORLD_SIZE,
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

judge_signature is currently derived only from judge_kwargs['model'], but prepare_reuse_files(... reuse_aux='all') uses this signature to decide whether to reuse evaluation auxiliary outputs. If judge configuration changes (e.g., base URL, verifier/vllm flags, judge_args affecting scoring), reusing eval aux based on model name alone can mix incompatible evaluation artifacts and yield incorrect results. Consider computing a stable signature from the relevant, non-secret judge settings (e.g., model + api_base + normalized judge_args + verifier/vllm flags) and store/compare that instead.

Copilot uses AI. Check for mistakes.
Comment on lines 269 to 288
# Try to load from the result file.
result_path = Path(cfg.result_file)
if result_path.exists():
try:
data = load(str(result_path))
if isinstance(data, pd.DataFrame):
existing_results = {
str(idx): pred
for idx, pred in zip(data['index'], data['prediction'])
if FAIL_MSG not in str(pred)
}
if self.retry_failed:
existing_results = {
str(idx): pred
for idx, pred in zip(data['index'], data['prediction'])
if FAIL_MSG not in str(pred)
}
else:
existing_results = {
str(idx): pred
for idx, pred in zip(data['index'], data['prediction'])
}
results.update(existing_results)
logger.info(f" [{dataset_name}] Loaded {len(existing_results)} "
"results from result file")
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_load_checkpoint() only reuses results from cfg.result_file when load() returns a pandas DataFrame. If PRED_FORMAT=json, dump() writes a list of records and load() returns a list, so this reuse path is skipped (or hits the exception handler) and the pipeline will re-infer already-finished samples. Consider supporting list-of-records / dict formats here (similar to the prediction-table normalization used in vlmeval.smp.file).

Copilot uses AI. Check for mistakes.

def write_run_status(run_dir, status):
run_dir = Path(run_dir)
run_dir.mkdir(exist_ok=True)
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

write_run_status() calls run_dir.mkdir(exist_ok=True) without parents=True. If a caller passes a nested run directory whose parents haven't been created yet, this will raise and prevent status.json from being written. Using mkdir(parents=True, exist_ok=True) would make this helper robust and consistent with other directory creation in the repo.

Suggested change
run_dir.mkdir(exist_ok=True)
run_dir.mkdir(parents=True, exist_ok=True)

Copilot uses AI. Check for mistakes.
frame = _prediction_table(src_file)
if frame is None:
logger.warning(
f'Can not convert prediction file {src_file} to target format {dst_suffix}, reuse skipped.'
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling/grammar: "Can not convert ..." should be "Cannot convert ..." in this warning message.

Suggested change
f'Can not convert prediction file {src_file} to target format {dst_suffix}, reuse skipped.'
f'Cannot convert prediction file {src_file} to target format {dst_suffix}, reuse skipped.'

Copilot uses AI. Check for mistakes.
Comment on lines +421 to +423
if args.ignore:
logger.warning('[Deprecated] the `--ignore` flag is deprecated since it is '
'the default behavior, use `--keep-failed` to disable it.')
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--ignore is described as a deprecated alias, but it currently only logs a warning and does not change behavior. As a result, users passing --ignore will still retry failed samples (since retry_failed=not args.keep_failed). If the intent is to keep backward compatibility, map args.ignore to args.keep_failed=True (and adjust the help text accordingly).

Suggested change
if args.ignore:
logger.warning('[Deprecated] the `--ignore` flag is deprecated since it is '
'the default behavior, use `--keep-failed` to disable it.')
if getattr(args, "ignore", False):
logger.warning('[Deprecated] the `--ignore` flag is deprecated since it is '
'the default behavior, use `--keep-failed` to disable it.')
# Maintain backward compatibility: `--ignore` behaves like `--keep-failed`.
# This ensures users passing `--ignore` will not retry failed samples.
if hasattr(args, "keep_failed"):
args.keep_failed = True

Copilot uses AI. Check for mistakes.
if rank == 0:
data = load(result_file)
# breakpoint()
results = {k: v for k, v in zip(data['index'], data['prediction'])}
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

load(result_file) can return a list of records when PRED_FORMAT=json (because dump() serializes DataFrames to to_dict('records') for JSON). This code assumes data is a DataFrame/dict with data['index'] and data['prediction'], which will raise at runtime when resuming from a JSON result file. Consider normalizing data to a DataFrame (or reusing the shared prediction-table parsing logic in vlmeval.smp.file._prediction_table) before zipping index/prediction.

Suggested change
results = {k: v for k, v in zip(data['index'], data['prediction'])}
# Normalize loaded prediction table to a dict[index] -> prediction
if isinstance(data, list):
# JSON format: list of record dicts, e.g. [{'index': ..., 'prediction': ...}, ...]
results = {
rec['index']: rec['prediction']
for rec in data
if isinstance(rec, dict) and 'index' in rec and 'prediction' in rec
}
else:
# DataFrame-like or dict-like with 'index' and 'prediction' columns/keys
try:
indices = data['index']
predictions = data['prediction']
results = {k: v for k, v in zip(indices, predictions)}
except (TypeError, KeyError):
# Unexpected structure; fall back to empty results to avoid crashing
results = {}

Copilot uses AI. Check for mistakes.
if osp.exists(result_file):
if rank == 0:
data = load(result_file)
results = {k: v for k, v in zip(data['index'], data['prediction'])}
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

load(result_file) can return a list of records when PRED_FORMAT=json (since dump() serializes DataFrames as records for JSON). This block assumes data supports data['index']/data['prediction'] like a DataFrame/dict, which will fail on resume with JSON output. Normalize data to a DataFrame (or reuse the shared prediction parsing helper) before building results.

Suggested change
results = {k: v for k, v in zip(data['index'], data['prediction'])}
# `load` may return a dict/DataFrame-like object or a list of records (for JSON "records" format).
if isinstance(data, list):
# Expect a list of {'index': ..., 'prediction': ...} records.
results = {rec['index']: rec['prediction'] for rec in data}
else:
results = {k: v for k, v in zip(data['index'], data['prediction'])}

Copilot uses AI. Check for mistakes.
Comment on lines +249 to +251
results = {k: v for k, v in zip(data['index'], data['prediction'])}
results = {k: v for k, v in results.items() if FAIL_MSG not in str(v)}
if len(results) == len(data):
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When PRED_FORMAT=json, dump() writes DataFrames as a list of record dicts, so load(result_file) returns a list. This resume/retry block assumes data['index'] and data['prediction'] exist, which will raise for JSON result files. Consider normalizing data to a DataFrame (or using the same prediction-table parsing logic used by reuse checks) before extracting index/prediction.

Suggested change
results = {k: v for k, v in zip(data['index'], data['prediction'])}
results = {k: v for k, v in results.items() if FAIL_MSG not in str(v)}
if len(results) == len(data):
# Normalize loaded prediction table to support both dict/DataFrame-like
# objects (with 'index' and 'prediction' keys/columns) and JSON
# record-style lists (list of dicts with those keys).
if isinstance(data, list):
indices = [item.get('index') for item in data]
preds = [item.get('prediction') for item in data]
else:
indices = data['index']
preds = data['prediction']
results = {k: v for k, v in zip(indices, preds)}
results = {k: v for k, v in results.items() if FAIL_MSG not in str(v)}
if len(results) == len(indices):

Copilot uses AI. Check for mistakes.
temperature=args.temperature,
top_k=args.top_k,
top_p=args.top_p,
repetition_penalty=args.repetition_penalty,
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

build_model_from_base_url() (and the corresponding CLI args) supports repetition_penalty but does not expose presence_penalty, which was previously supported by the removed run_api.py runner and is used by several API/model implementations in this repo. If the intent is full feature parity when using --base-url, consider adding a --presence-penalty arg and including it in model_args (similar to repetition_penalty).

Suggested change
repetition_penalty=args.repetition_penalty,
repetition_penalty=args.repetition_penalty,
presence_penalty=args.presence_penalty,

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants